Skip to content

fix: trim before parsing numbers#9537

Open
aryan-212 wants to merge 1 commit intoapache:mainfrom
aryan-212:trim-num-str
Open

fix: trim before parsing numbers#9537
aryan-212 wants to merge 1 commit intoapache:mainfrom
aryan-212:trim-num-str

Conversation

@aryan-212
Copy link

@aryan-212 aryan-212 commented Mar 11, 2026

Which issue does this PR close?

Rationale for this change

The Parser::parse implementations for numeric types did not trim whitespace before parsing. This caused values like " 42 " or " 1.5 " to fail parsing and return None, even though they represent valid numbers.

What changes are included in this PR?

  • Added .trim() calls before parsing in FloatType Parser implementations.
  • Added string.trim() at the top of the parser_primitive! macro, which covers all integer and duration types.

Are these changes tested?

Yes. Added test_parse_trimmed_whitespace covering:

  • Float types with leading/trailing spaces and tabs/newlines
  • Signed and unsigned integer types with whitespace
  • Negative integers with whitespace
  • Whitespace-only strings returning None

Datafusion changes
For the following SQL :-

 SELECT
    substring('Suite 28', 6) AS extracted,
    length(substring('Suite 28', 6)) AS extracted_length,
    CAST(substring('Suite 28', 6) AS INT) AS extracted_int,
    CAST(substring('Suite 28', 6) AS INT) + 1 AS plus_one;

in datafusion we used to get

extracted extracted_length extracted_int plus_one
28 3 null null

now after these changes, we get

extracted extracted_length extracted_int plus_one
28 3 28 29

this behaviour is now aligned with Databricks

Are there any user-facing changes?

Yes. Numeric parsing now accepts strings with leading/trailing whitespace. This is a relaxation of the previous behaviour (previously None, now Some(value)), so it is not a breaking change.

@tustvold
Copy link
Contributor

Have you run the benchmarks for this?

@aryan-212
Copy link
Author

Have you run the benchmarks for this?

sorry, new here, could you tell me how do I run them? 😅

@Rafferty97
Copy link
Contributor

Have you run the benchmarks for this?

sorry, new here, could you tell me how do I run them? 😅

I think "cargo bench -p arrow-cast" should be sufficient.

@alamb
Copy link
Contributor

alamb commented Mar 17, 2026

run benchmark cast_kernels

@alamb
Copy link
Contributor

alamb commented Mar 17, 2026

Thank you @Rafferty97 and @aryan-212 -- I kicked off some benchmark runs to verify the performance implications of this change

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @aryan-212 and (especially) for the help reviwing @Rafferty97

impl Parser for Float16Type {
fn parse(string: &str) -> Option<f16> {
lexical_core::parse(string.as_bytes())
lexical_core::parse(string.trim().as_bytes())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there will be any performance implications 🤔

#[test]
fn test_parse_trimmed_whitespace() {
// Float types
assert_eq!(Float16Type::parse(" 1.5 "), Some(f16::from_f32(1.5)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cn you please add some tests with only leading whitespace and some tests with only trailing whitespace?

@adriangbot
Copy link

🤖 Arrow criterion benchmark running (GKE) | trigger
Linux bench-c4077394630-385-c4bkf 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing trim-num-str (8fbf1b8) to d3c7900 (merge-base) diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench cast_kernels
BENCH_FILTER=
Results will be posted here when complete

@adriangbot
Copy link

🤖 Arrow criterion benchmark completed (GKE) | trigger

Details

group                                                              main                                   trim-num-str
-----                                                              ----                                   ------------
cast binary view to string                                         1.00     59.9±0.57µs        ? ?/sec    1.08     64.4±0.53µs        ? ?/sec
cast binary view to string view                                    1.00     64.8±0.31µs        ? ?/sec    1.00     64.5±0.43µs        ? ?/sec
cast binary view to wide string                                    1.07     63.4±0.58µs        ? ?/sec    1.00     59.2±0.50µs        ? ?/sec
cast date32 to date64 512                                          1.01    343.2±5.68ns        ? ?/sec    1.00    339.8±1.30ns        ? ?/sec
cast date64 to date32 512                                          1.00    411.5±3.73ns        ? ?/sec    1.00    412.0±1.29ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.00    578.8±1.08ns        ? ?/sec    1.01    583.6±2.96ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.01      2.5±0.00µs        ? ?/sec    1.00      2.5±0.00µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.00      3.0±0.00µs        ? ?/sec    1.00      3.0±0.00µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.00     74.7±0.58ns        ? ?/sec    1.01     75.5±0.61ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.00   1824.2±2.62ns        ? ?/sec    1.00   1824.4±1.85ns        ? ?/sec
cast decimal256 to decimal128 512                                  1.00     21.5±0.03µs        ? ?/sec    1.00     21.4±0.02µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.01      6.5±0.02µs        ? ?/sec    1.00      6.5±0.01µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.00     74.3±0.55ns        ? ?/sec    1.00     74.5±0.42ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.00    993.7±0.70ns        ? ?/sec    1.06   1054.6±1.97ns        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.00   1205.0±1.11ns        ? ?/sec    1.05   1260.3±2.02ns        ? ?/sec
cast decimal32 to decimal64 512                                    1.00    363.3±1.43ns        ? ?/sec    1.01    365.2±2.29ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.00      2.4±0.00µs        ? ?/sec    1.01      2.4±0.00µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.00    327.9±0.56ns        ? ?/sec    1.02    333.2±2.23ns        ? ?/sec
cast dict to string view                                           1.04     42.1±1.29µs        ? ?/sec    1.00     40.6±1.74µs        ? ?/sec
cast f32 to string 512                                             1.00     11.9±0.04µs        ? ?/sec    1.00     11.8±0.07µs        ? ?/sec
cast f64 to string 512                                             1.00     15.6±0.04µs        ? ?/sec    1.00     15.6±0.06µs        ? ?/sec
cast float32 to int32 512                                          1.00   1400.0±5.85ns        ? ?/sec    1.01   1407.8±8.22ns        ? ?/sec
cast float64 to float32 512                                        1.00    687.2±2.99ns        ? ?/sec    1.03    709.3±2.89ns        ? ?/sec
cast float64 to uint64 512                                         1.00  1474.1±10.57ns        ? ?/sec    1.00  1474.7±13.82ns        ? ?/sec
cast i64 to string 512                                             1.00      9.3±0.05µs        ? ?/sec    1.00      9.2±0.04µs        ? ?/sec
cast int32 to float32 512                                          1.00    710.7±2.85ns        ? ?/sec    1.00    708.8±6.04ns        ? ?/sec
cast int32 to float64 512                                          1.00    704.3±3.35ns        ? ?/sec    1.03    726.7±4.57ns        ? ?/sec
cast int32 to int32 512                                            1.03    177.4±6.40ns        ? ?/sec    1.00    172.9±1.12ns        ? ?/sec
cast int32 to int64 512                                            1.00    726.6±2.91ns        ? ?/sec    1.03    744.8±8.87ns        ? ?/sec
cast int32 to uint32 512                                           1.00   1396.5±1.18ns        ? ?/sec    1.00   1397.5±3.32ns        ? ?/sec
cast int64 to int32 512                                            1.00   1489.5±1.62ns        ? ?/sec    1.01   1497.6±6.69ns        ? ?/sec
cast no runs of int32s to ree<int32>                               1.01     57.5±1.17µs        ? ?/sec    1.00     57.2±1.74µs        ? ?/sec
cast runs of 10 string to ree<int32>                               1.00      8.7±0.09µs        ? ?/sec    1.00      8.7±0.06µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             1.00      3.4±0.01µs        ? ?/sec    1.00      3.4±0.01µs        ? ?/sec
cast string single run to ree<int32>                               1.00     27.3±0.02µs        ? ?/sec    1.00     27.3±0.02µs        ? ?/sec
cast string to binary view 512                                     1.01      2.4±0.01µs        ? ?/sec    1.00      2.4±0.01µs        ? ?/sec
cast string view to binary view                                    1.00     72.9±0.84ns        ? ?/sec    1.00     73.3±2.10ns        ? ?/sec
cast string view to dict                                           1.01    175.7±1.04µs        ? ?/sec    1.00    174.4±0.67µs        ? ?/sec
cast string view to string                                         1.00     39.6±1.53µs        ? ?/sec    1.00     39.4±1.21µs        ? ?/sec
cast string view to wide string                                    1.00     39.7±1.39µs        ? ?/sec    1.00     39.8±1.62µs        ? ?/sec
cast time32s to time32ms 512                                       1.08    160.2±4.02ns        ? ?/sec    1.00    148.6±0.79ns        ? ?/sec
cast time32s to time64us 512                                       1.00    337.9±0.66ns        ? ?/sec    1.00    339.4±1.36ns        ? ?/sec
cast time64ns to time32s 512                                       1.02    420.2±0.69ns        ? ?/sec    1.00    411.9±0.99ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.00    255.1±3.19ns        ? ?/sec    1.00    254.7±1.70ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.00   1873.5±3.06ns        ? ?/sec    1.00   1876.5±8.00ns        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.01    171.9±2.06ns        ? ?/sec    1.00    170.9±1.24ns        ? ?/sec
cast utf8 to date32 512                                            1.02      6.6±0.03µs        ? ?/sec    1.00      6.5±0.03µs        ? ?/sec
cast utf8 to date64 512                                            1.00     32.1±0.09µs        ? ?/sec    1.00     32.0±0.11µs        ? ?/sec
cast utf8 to f32                                                   1.00      5.7±0.04µs        ? ?/sec    1.16      6.6±0.03µs        ? ?/sec
cast wide string to binary view 512                                1.02      4.2±0.28µs        ? ?/sec    1.00      4.1±0.08µs        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 491.9s
Peak memory 2.1 GiB
Avg memory 2.1 GiB
CPU user 490.6s
CPU sys 1.1s
Disk read 0 B
Disk write 1.2 GiB

branch

Metric Value
Wall time 488.7s
Peak memory 2.1 GiB
Avg memory 2.1 GiB
CPU user 488.5s
CPU sys 0.2s
Disk read 0 B
Disk write 3.8 MiB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

arrow-cast numeric parsers fail to parse whitespace-padded strings

5 participants